Impact of Similarity Measures on Web-page Clustering
نویسندگان
چکیده
Clustering of web documents enables (semi-)automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the past, there is no systematic comparative study of the impact of similarity metrics on cluster quality, possibly because the popular cost criteria do not readily translate across qualitatively di erent metrics. We observe that in domains such as Yahoo that provide a categorization by human experts, a useful criteria for comparisons across similarity metrics is indeed available. We then compare four popular similarity measures (Euclidean, cosine, Pearson correlation and extended Jaccard) in conjunction with several clustering techniques (random, self-organizing feature map, hyper-graph partitioning, generalized kmeans, weighted graph partitioning), on high dimensional sparse data representing web documents. Performance is measured against a human-imposed classi cation into news categories and industry categories. We conduct a number of experiments and use t-tests to assure statistical signi cance of results. Cosine and extended Jaccard similarities emerge as the best measures to capture human categorization behavior, while Euclidean performs poorest. Also, weighted graph partitioning approaches are clearly superior to all others.
منابع مشابه
Seqdbscan : a New Sequence Dbscan Algorithm for Clustering of Web Usage Data
Web is a vast area for data mining research. It is used in finding the user access patterns from web access log. User page visits are sequential in nature In this paper,I proposed a new clustering algorithm, SeqDBSCAN for clustering sequential data.. we adopted a similarity preserving function called sequence and set similarity measure SM that captures both the order of occurrence of page visit...
متن کاملAn Empirical Comparison of Distance Measures for Multivariate Time Series Clustering
Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...
متن کاملA Web Search Engine-Based Approach to Measure Semantic Similarity between Words
easuring the semantic similarity between words is an important component in various tasks on the web such as relation extraction, community mining, document clustering, and automatic metadata extraction. Despite the usefulness of semantic similarity measures in these applications, accurately measuring semantic similarity between two words (or entities) remains a challenging task. We propose an ...
متن کاملA Web Search Engine-based Approach to Measure Semantic Similarity between Words
Measuring the semantic similarity between words is an important component in various tasks on the web such as relation extraction, community mining, document clustering, and automatic metadata extraction. Despite the usefulness of semantic similarity measures in these applications, accurately measuring semantic similarity between two words (or entities) remains a challenging task. We propose an...
متن کاملAn adaptive neural network approach to hypertext clustering
The WWW is an on-line hypertextual collection, and a more sophisticated algorithm for Web page clustering may have to be based on combined term-similarity and hyperlink-similarity measures. It has been observed that nearly all currently employed techniques for document classification on the Web make use of textual information only. In addition, most of these techniques are incapable of discover...
متن کامل